I’ll post the script without surpressing any warning or error messages. This will make the output more verbose, but hopefully also more transparent.

Before we start, we have to load the packages necessary for the reading in the data and manipulating it. The tidyverse package collection provides a variety of tools from packages like dplyr, tidyr, purrr and ggplot2, readxl provides a robust method to directly read from Excel files, such as .xls and .xlsx files, and lubridate provides intuitive tools for handling of dates and times. To access the API we need functions from the httr package, and eventually we need to fit generalized linear models (GAM) from the mgcv package. For data base related work, the DBI package provides the necessary interface, and RSQLite the necessary functionalities. Finally, sf helps in the graphical display of the measurement sites and plotly makes the graphics interactive.

Make sure to follow these instructions before installing sf

The development versions of ggplot2 and plotly from Github are necessary to visualize sf elements (interactively)

# install.packages(c("devtools", "RSQLite", "sf", "tidyverse"))
# devtools::install_github("tidyverse/ggplot2")
# devtools::install_github("ropensci/plotly")

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
## ✔ tibble  1.4.2          ✔ dplyr   0.7.4     
## ✔ tidyr   0.8.0          ✔ stringr 1.3.1     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::vars()   masks ggplot2::vars()
library(readxl)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(httr)
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
library(DBI)
library(RSQLite)
library(sf)
## Linking to GEOS 3.6.2, GDAL 2.2.4, proj.4 5.0.1

1 Reading in the data

First we need to read in the Excel sheet, which is inside a Data folder within my RStudio project. We take a direct look at the raw data for a first sanity check of the data.

data_raw <- read_excel("../Data/tidal_sample.xlsx")

data_raw
## # A tibble: 250 x 8
##    Date       Time  Substrattype `GPS Latitude` `GPS Longitude` depth
##    <chr>      <chr> <chr>        <chr>          <chr>           <chr>
##  1 03.04.2017 14:06 Grus-Stein   63,0902        7,26505         31,3 
##  2 03.04.2017 14:23 Sand         63,10097       7,29703         36,9 
##  3 11.04.2017 15:25 Grus         62,50683       5,9377          31,6 
##  4 11.04.2017 15:37 Skjellsand   62,51823       5,93911         25,3 
##  5 11.04.2017 15:55 Skjellsand   62,52615       5,922           32,7 
##  6 12.04.2017 12:05 Stein        62,56414       5,88093         43,3 
##  7 23.06.2017 13:14 Stein        62,35022       5,42584         37,4 
##  8 06.04.2017 16:00 Sand-Stein   63,47915       8,1177          31,3 
##  9 06.04.2017 15:43 Fjell        63,48384       8,12161         33,1 
## 10 06.04.2017 16:17 Fjell-Stein  63,48167       8,07677         32,8 
## # ... with 240 more rows, and 2 more variables: `Kommentar til observasjon
## #   (evt fremmede arter)` <chr>, `Anslått stortaresubstrat (%)` <chr>

There are a couple of issues with the way the data is recognized by R. The GPS Latitude, GPS Longitude and depth columns were recognized as characters, and the decimal sign remains a comma, instead of a point. The column names consist of both English and Norwegian titles, and the columns for the GPS coordinates have a space in their titles, which could cause problems downstream. Time for some tidying of the data!

1.1 Tidying up the data

First, we rename() some of the column names to have them in a uniform way. In the next step, we mutate() the GPS coordinates and depth values, swapping the comma for a point using str_replace() and turning the columns into numeric (as.double()) columns. As the column for the kelp percentage did not contain decimals, we can directly convert it to numeric. At the same time, we turn the Date column into an actual date column using the dmy() (for day month year) function.

data <- data_raw %>%
  rename(Substrate_Type = Substrattype, 
         GPS_Latitude = `GPS Latitude`, 
         GPS_Longitude = `GPS Longitude`, 
         Depth = depth,
         Comment = `Kommentar til observasjon (evt fremmede arter)`, 
         Kelp_Percentage = `Anslått stortaresubstrat (%)`) %>% 
  mutate(Date = dmy(Date), 
         GPS_Latitude = str_replace(string = GPS_Latitude, pattern = ",", replacement = "."),
         GPS_Latitude = as.double(GPS_Latitude),
         GPS_Longitude = str_replace(string = GPS_Longitude, pattern = ",", replacement = "."),
         GPS_Longitude = as.double(GPS_Longitude),
         Depth = str_replace(string = Depth, pattern = ",", replacement = "."),
         Depth = as.double(Depth),
         Kelp_Percentage = as.double(Kelp_Percentage))
## Warning: 2 failed to parse.
## Warning in evalq(as.double(GPS_Latitude), <environment>): NAs introduced by
## coercion
## Warning in evalq(as.double(GPS_Longitude), <environment>): NAs introduced
## by coercion
## Warning in evalq(as.double(Depth), <environment>): NAs introduced by
## coercion
## Warning in evalq(as.double(Kelp_Percentage), <environment>): NAs introduced
## by coercion
data
## # A tibble: 250 x 8
##    Date       Time  Substrate_Type GPS_Latitude GPS_Longitude Depth
##    <date>     <chr> <chr>                 <dbl>         <dbl> <dbl>
##  1 2017-04-03 14:06 Grus-Stein             63.1          7.27  31.3
##  2 2017-04-03 14:23 Sand                   63.1          7.30  36.9
##  3 2017-04-11 15:25 Grus                   62.5          5.94  31.6
##  4 2017-04-11 15:37 Skjellsand             62.5          5.94  25.3
##  5 2017-04-11 15:55 Skjellsand             62.5          5.92  32.7
##  6 2017-04-12 12:05 Stein                  62.6          5.88  43.3
##  7 2017-06-23 13:14 Stein                  62.4          5.43  37.4
##  8 2017-04-06 16:00 Sand-Stein             63.5          8.12  31.3
##  9 2017-04-06 15:43 Fjell                  63.5          8.12  33.1
## 10 2017-04-06 16:17 Fjell-Stein            63.5          8.08  32.8
## # ... with 240 more rows, and 2 more variables: Comment <chr>,
## #   Kelp_Percentage <dbl>

The tidying was successful and the data looks good. The warning messages likely occured because of missing values in the GPS coordinates and depth columns. For another sanity check, we take a look at the summary() output.

data %>% summary()
##       Date                Time           Substrate_Type    
##  Min.   :2017-04-03   Length:250         Length:250        
##  1st Qu.:2017-04-05   Class :character   Class :character  
##  Median :2017-04-10   Mode  :character   Mode  :character  
##  Mean   :2017-04-19                                        
##  3rd Qu.:2017-04-25                                        
##  Max.   :2017-06-23                                        
##  NA's   :2                                                 
##   GPS_Latitude   GPS_Longitude        Depth         Comment         
##  Min.   : 7.23   Min.   : 5.264   Min.   : 2.70   Length:250        
##  1st Qu.:62.23   1st Qu.: 5.446   1st Qu.:10.00   Class :character  
##  Median :62.52   Median : 6.091   Median :18.20   Mode  :character  
##  Mean   :62.44   Mean   : 6.789   Mean   :19.73                     
##  3rd Qu.:63.05   3rd Qu.: 7.738   3rd Qu.:29.50                     
##  Max.   :63.50   Max.   :63.106   Max.   :43.30                     
##  NA's   :1       NA's   :1        NA's   :1                         
##  Kelp_Percentage 
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median : 10.00  
##  Mean   : 42.31  
##  3rd Qu.:100.00  
##  Max.   :100.00  
##  NA's   :7

Multiple issues here. Two values without Date seem dubious, as do the extreme values for the GPS coordinates; likely, someone mixed up longitude and latitude in the report. Let’s analyse the raw data of the dates as well as the tidied data of the coordinates for more insight.

data_raw %>% 
  select(Date) %>% 
  unique()
## # A tibble: 15 x 1
##    Date      
##    <chr>     
##  1 03.04.2017
##  2 11.04.2017
##  3 12.04.2017
##  4 23.06.2017
##  5 06.04.2017
##  6 04.04.2017
##  7 22.06.2017
##  8 05.04.2017
##  9 09.04.2017
## 10 10.04.2017
## 11 25.04.2017
## 12 24.04.2017
## 13 26.04.2017
## 14 2017-06-22
## 15 <Null>
data %>% 
  select(GPS_Longitude) %>% 
  arrange(desc(GPS_Longitude))
## # A tibble: 250 x 1
##    GPS_Longitude
##            <dbl>
##  1         63.1 
##  2          8.25
##  3          8.24
##  4          8.23
##  5          8.22
##  6          8.22
##  7          8.22
##  8          8.22
##  9          8.21
## 10          8.21
## # ... with 240 more rows
data %>% 
  select(GPS_Latitude) %>% 
  arrange(GPS_Latitude)
## # A tibble: 250 x 1
##    GPS_Latitude
##           <dbl>
##  1         7.23
##  2        62.0 
##  3        62.0 
##  4        62.0 
##  5        62.1 
##  6        62.1 
##  7        62.1 
##  8        62.1 
##  9        62.1 
## 10        62.1 
## # ... with 240 more rows

The <NULL> value for the date seems like a “real” missing value, but there seems to be one date which has been specified differently. This needs to be adjusted. As suspected, one observation seems to have swapped the longitude and latitude information.

First, we now take care of the different date value. The easiest way is to repeat the original tidying up of the data set, but specifying the Date conversion differently. This time we use parse_date_time() to explicitly specify both ways in which the dates were specified, and set the time zone. In addition, we include an additional mutate() to take care of the GPS coordinate swap. For lack of a more elegant version this works by creating dummy columns based on conditions defined inside case_when(). These dummy columns are removed in the last select() step.

data <- data_raw %>%
  rename(Substrate_Type = Substrattype, 
         GPS_Latitude = `GPS Latitude`, 
         GPS_Longitude = `GPS Longitude`, 
         Depth = depth,
         Comment = `Kommentar til observasjon (evt fremmede arter)`, 
         Kelp_Percentage = `Anslått stortaresubstrat (%)`) %>% 
  mutate(Date = parse_date_time(Date, orders = c("d.m.Y", "Y-m-d"), tz = "Europe/Oslo"), 
         GPS_Latitude = str_replace(string = GPS_Latitude, pattern = ",", replacement = "."),
         GPS_Latitude = as.double(GPS_Latitude),
         GPS_Longitude = str_replace(string = GPS_Longitude, pattern = ",", replacement = "."),
         GPS_Longitude = as.double(GPS_Longitude),
         Depth = str_replace(string = Depth, pattern = ",", replacement = "."),
         Depth = as.double(Depth),
         Kelp_Percentage = as.double(Kelp_Percentage)) %>% 
  mutate(GPS_Latitude2 = case_when(GPS_Latitude < 10 ~ GPS_Longitude, 
                                   GPS_Latitude > 10 ~ GPS_Latitude), 
         GPS_Longitude2 = case_when(GPS_Longitude > 60 ~ GPS_Latitude, 
                                    GPS_Longitude < 60 ~ GPS_Longitude), 
         GPS_Latitude = GPS_Latitude2, 
         GPS_Longitude = GPS_Longitude2) %>% 
  select(-GPS_Latitude2, -GPS_Longitude2)
## Warning: 1 failed to parse.
## Warning in evalq(as.double(GPS_Latitude), <environment>): NAs introduced by
## coercion
## Warning in evalq(as.double(GPS_Longitude), <environment>): NAs introduced
## by coercion
## Warning in evalq(as.double(Depth), <environment>): NAs introduced by
## coercion
## Warning in evalq(as.double(Kelp_Percentage), <environment>): NAs introduced
## by coercion
data
## # A tibble: 250 x 8
##    Date                Time  Substrate_Type GPS_Latitude GPS_Longitude
##    <dttm>              <chr> <chr>                 <dbl>         <dbl>
##  1 2017-04-03 00:00:00 14:06 Grus-Stein             63.1          7.27
##  2 2017-04-03 00:00:00 14:23 Sand                   63.1          7.30
##  3 2017-04-11 00:00:00 15:25 Grus                   62.5          5.94
##  4 2017-04-11 00:00:00 15:37 Skjellsand             62.5          5.94
##  5 2017-04-11 00:00:00 15:55 Skjellsand             62.5          5.92
##  6 2017-04-12 00:00:00 12:05 Stein                  62.6          5.88
##  7 2017-06-23 00:00:00 13:14 Stein                  62.4          5.43
##  8 2017-04-06 00:00:00 16:00 Sand-Stein             63.5          8.12
##  9 2017-04-06 00:00:00 15:43 Fjell                  63.5          8.12
## 10 2017-04-06 00:00:00 16:17 Fjell-Stein            63.5          8.08
## # ... with 240 more rows, and 3 more variables: Depth <dbl>,
## #   Comment <chr>, Kelp_Percentage <dbl>
data %>% 
  summary()
##       Date                         Time           Substrate_Type    
##  Min.   :2017-04-03 00:00:00   Length:250         Length:250        
##  1st Qu.:2017-04-05 00:00:00   Class :character   Class :character  
##  Median :2017-04-10 00:00:00   Mode  :character   Mode  :character  
##  Mean   :2017-04-19 11:22:24                                        
##  3rd Qu.:2017-04-25 00:00:00                                        
##  Max.   :2017-06-23 00:00:00                                        
##  NA's   :1                                                          
##   GPS_Latitude   GPS_Longitude       Depth         Comment         
##  Min.   :62.04   Min.   :5.264   Min.   : 2.70   Length:250        
##  1st Qu.:62.23   1st Qu.:5.446   1st Qu.:10.00   Class :character  
##  Median :62.52   Median :6.091   Median :18.20   Mode  :character  
##  Mean   :62.66   Mean   :6.565   Mean   :19.73                     
##  3rd Qu.:63.06   3rd Qu.:7.724   3rd Qu.:29.50                     
##  Max.   :63.50   Max.   :8.248   Max.   :43.30                     
##  NA's   :1       NA's   :1       NA's   :1                         
##  Kelp_Percentage 
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median : 10.00  
##  Mean   : 42.31  
##  3rd Qu.:100.00  
##  Max.   :100.00  
##  NA's   :7

It looks like the issues have been fixed and we can proceed further. The final step in tidying up is to bring all the necessary columns into an adequate form for using it with the API in the next step. This includes the re-transformation of Date into a character, as well as dropping the observation (drop_na()) with either missing data in Date, Time, GPS_Longitude, GPS_Latitude, or Depth.

data_tidy <- data %>%
  drop_na(Date, Time, GPS_Latitude, GPS_Longitude, Depth) %>%
  mutate(Date = as.character(Date))

data_tidy
## # A tibble: 249 x 8
##    Date   Time  Substrate_Type GPS_Latitude GPS_Longitude Depth Comment   
##    <chr>  <chr> <chr>                 <dbl>         <dbl> <dbl> <chr>     
##  1 2017-… 14:06 Grus-Stein             63.1          7.27  31.3 Vegetasjo…
##  2 2017-… 14:23 Sand                   63.1          7.30  36.9 Vegetasjo…
##  3 2017-… 15:25 Grus                   62.5          5.94  31.6 Grusbunn …
##  4 2017-… 15:37 Skjellsand             62.5          5.94  25.3 Vegetasjo…
##  5 2017-… 15:55 Skjellsand             62.5          5.92  32.7 Vegetasjo…
##  6 2017-… 12:05 Stein                  62.6          5.88  43.3 Steinbunn…
##  7 2017-… 13:14 Stein                  62.4          5.43  37.4 Steinbunn…
##  8 2017-… 16:00 Sand-Stein             63.5          8.12  31.3 Sandbunn …
##  9 2017-… 15:43 Fjell                  63.5          8.12  33.1 Vegetasjo…
## 10 2017-… 16:17 Fjell-Stein            63.5          8.08  32.8 Vegetasjo…
## # ... with 239 more rows, and 1 more variable: Kelp_Percentage <dbl>

This looks good and we can continue with setting everything up for the API connection!

2 Connect to the API and retrieve the response

In order to retrieve the information for every observation, we need to construct individual API calls for each observation. We’ll do this by creating dummy columns to later assemble the final URL for the API call by pasting the columns together. Each dummy column thus represents a functionality of the API, as specified in the user instructions. The precise API we use is the Water level data in position. The different variables were chosen as follows:

The URL is then constructed from these variables for each observation.

data_API <- data_tidy %>%
  mutate(API = "http://api.sehavniva.no/tideapi.php?tide_request=locationdata", 
         lat = GPS_Latitude,
         lon = GPS_Longitude,
         datatype = "OBS",
         file = "NSKV",
         lang = "en",
         dst = 1,
         refcode = "CD",
         fromtime = str_c(Date, "T", "00:00"),
         totime = str_c(Date, "T", "23:59"),
         interval = 10) %>%
  mutate(URL = str_c(API, "&lat=", lat, "&lon=", lon, "&datatype=", datatype,
                     "&file=", file, "&lang=", lang, "&dst=", dst,
                     "&refcode=", refcode, "&fromtime=", fromtime,
                     "&totime=", totime,
                     "&interval=", interval))

data_API
## # A tibble: 249 x 20
##    Date   Time  Substrate_Type GPS_Latitude GPS_Longitude Depth Comment   
##    <chr>  <chr> <chr>                 <dbl>         <dbl> <dbl> <chr>     
##  1 2017-… 14:06 Grus-Stein             63.1          7.27  31.3 Vegetasjo…
##  2 2017-… 14:23 Sand                   63.1          7.30  36.9 Vegetasjo…
##  3 2017-… 15:25 Grus                   62.5          5.94  31.6 Grusbunn …
##  4 2017-… 15:37 Skjellsand             62.5          5.94  25.3 Vegetasjo…
##  5 2017-… 15:55 Skjellsand             62.5          5.92  32.7 Vegetasjo…
##  6 2017-… 12:05 Stein                  62.6          5.88  43.3 Steinbunn…
##  7 2017-… 13:14 Stein                  62.4          5.43  37.4 Steinbunn…
##  8 2017-… 16:00 Sand-Stein             63.5          8.12  31.3 Sandbunn …
##  9 2017-… 15:43 Fjell                  63.5          8.12  33.1 Vegetasjo…
## 10 2017-… 16:17 Fjell-Stein            63.5          8.08  32.8 Vegetasjo…
## # ... with 239 more rows, and 13 more variables: Kelp_Percentage <dbl>,
## #   API <chr>, lat <dbl>, lon <dbl>, datatype <chr>, file <chr>,
## #   lang <chr>, dst <dbl>, refcode <chr>, fromtime <chr>, totime <chr>,
## #   interval <dbl>, URL <chr>

This looks good (at least there were no errors). No it’s time to call the API, using GET().

data_API <- data_API %>% 
  mutate(API_GET = map(URL, ~ GET(.)))

data_API
## # A tibble: 249 x 21
##    Date   Time  Substrate_Type GPS_Latitude GPS_Longitude Depth Comment   
##    <chr>  <chr> <chr>                 <dbl>         <dbl> <dbl> <chr>     
##  1 2017-… 14:06 Grus-Stein             63.1          7.27  31.3 Vegetasjo…
##  2 2017-… 14:23 Sand                   63.1          7.30  36.9 Vegetasjo…
##  3 2017-… 15:25 Grus                   62.5          5.94  31.6 Grusbunn …
##  4 2017-… 15:37 Skjellsand             62.5          5.94  25.3 Vegetasjo…
##  5 2017-… 15:55 Skjellsand             62.5          5.92  32.7 Vegetasjo…
##  6 2017-… 12:05 Stein                  62.6          5.88  43.3 Steinbunn…
##  7 2017-… 13:14 Stein                  62.4          5.43  37.4 Steinbunn…
##  8 2017-… 16:00 Sand-Stein             63.5          8.12  31.3 Sandbunn …
##  9 2017-… 15:43 Fjell                  63.5          8.12  33.1 Vegetasjo…
## 10 2017-… 16:17 Fjell-Stein            63.5          8.08  32.8 Vegetasjo…
## # ... with 239 more rows, and 14 more variables: Kelp_Percentage <dbl>,
## #   API <chr>, lat <dbl>, lon <dbl>, datatype <chr>, file <chr>,
## #   lang <chr>, dst <dbl>, refcode <chr>, fromtime <chr>, totime <chr>,
## #   interval <dbl>, URL <chr>, API_GET <list>

It worked! Now let’s access the information and use it to correct the depth measurments.

2.1 Use the API data to model the chart datum

To access the retrieved data, it’s easiest to specify a custom function to call for each observation in the data set. The following function removes some of the “meta data” and cleans up the returned table considerably.

read_the_API_data <- function(df) {
  df2 <- content(df) %>% 
    read_delim(delim = "\r\n") %>% 
    filter(row_number() >= 10)
  colnames(df2) <- "Date"
  df3 <- df2 %>% 
    filter(str_detect(string = Date, pattern = "_") == FALSE) %>% 
    separate(Date, into = c("Date", "Tide"), sep = "  ") %>% 
    mutate(Hour = str_sub(string = Date, start = 12, end = 13), 
           Minute = str_sub(string = Date, start = 14, end = 15)) %>% 
    unite(Time, Hour, Minute, sep = ":") %>% 
    mutate(Date = ymd_hm(Date), 
           Time = hm(Time),
           Time = (as.double(Time) / 60) / 60,
           Tide = as.double(Tide))
  return(df3)
}

Let’s try it out using map().

data_API <- data_API %>%
  mutate(API_Data = map(API_GET, ~ read_the_API_data(.)))

data_API
## # A tibble: 249 x 22
##    Date   Time  Substrate_Type GPS_Latitude GPS_Longitude Depth Comment   
##    <chr>  <chr> <chr>                 <dbl>         <dbl> <dbl> <chr>     
##  1 2017-… 14:06 Grus-Stein             63.1          7.27  31.3 Vegetasjo…
##  2 2017-… 14:23 Sand                   63.1          7.30  36.9 Vegetasjo…
##  3 2017-… 15:25 Grus                   62.5          5.94  31.6 Grusbunn …
##  4 2017-… 15:37 Skjellsand             62.5          5.94  25.3 Vegetasjo…
##  5 2017-… 15:55 Skjellsand             62.5          5.92  32.7 Vegetasjo…
##  6 2017-… 12:05 Stein                  62.6          5.88  43.3 Steinbunn…
##  7 2017-… 13:14 Stein                  62.4          5.43  37.4 Steinbunn…
##  8 2017-… 16:00 Sand-Stein             63.5          8.12  31.3 Sandbunn …
##  9 2017-… 15:43 Fjell                  63.5          8.12  33.1 Vegetasjo…
## 10 2017-… 16:17 Fjell-Stein            63.5          8.08  32.8 Vegetasjo…
## # ... with 239 more rows, and 15 more variables: Kelp_Percentage <dbl>,
## #   API <chr>, lat <dbl>, lon <dbl>, datatype <chr>, file <chr>,
## #   lang <chr>, dst <dbl>, refcode <chr>, fromtime <chr>, totime <chr>,
## #   interval <dbl>, URL <chr>, API_GET <list>, API_Data <list>

No errors, no warning, we’re happy for the time being. We now have a temporal data set for the water level of the day for each observation. To get a approximately correct value with which to correct the measured depth, we have to create a function based on the data retrieved from the API. For tidal data like this (going up and down over time), it makes sense to use a “simple” generalized additive model (GAM), which can handle this sort of pattern. Since we use this GAM for predictions only, and not (!) to draw statistical conclusions, we can accept a certain amout of overfitting to get a good predicted value for the water level at the time of measurement. But first, we need to specify a convenience function to create a GAM for each observational element.

GAM_from_API_Data <- function(df) {
  gam(formula = Tide ~ s(Time, k = 24), data = df)
}

Let’s put it in action and see what we get. Note that we wrap the function into possibly() to not break the entire calculation in case of missing values for some of the observations.

data_API <- data_API %>%
  mutate(API_GAM = map(API_Data, possibly(~ GAM_from_API_Data(.), otherwise = NA))) 
data_API
## # A tibble: 249 x 23
##    Date   Time  Substrate_Type GPS_Latitude GPS_Longitude Depth Comment   
##    <chr>  <chr> <chr>                 <dbl>         <dbl> <dbl> <chr>     
##  1 2017-… 14:06 Grus-Stein             63.1          7.27  31.3 Vegetasjo…
##  2 2017-… 14:23 Sand                   63.1          7.30  36.9 Vegetasjo…
##  3 2017-… 15:25 Grus                   62.5          5.94  31.6 Grusbunn …
##  4 2017-… 15:37 Skjellsand             62.5          5.94  25.3 Vegetasjo…
##  5 2017-… 15:55 Skjellsand             62.5          5.92  32.7 Vegetasjo…
##  6 2017-… 12:05 Stein                  62.6          5.88  43.3 Steinbunn…
##  7 2017-… 13:14 Stein                  62.4          5.43  37.4 Steinbunn…
##  8 2017-… 16:00 Sand-Stein             63.5          8.12  31.3 Sandbunn …
##  9 2017-… 15:43 Fjell                  63.5          8.12  33.1 Vegetasjo…
## 10 2017-… 16:17 Fjell-Stein            63.5          8.08  32.8 Vegetasjo…
## # ... with 239 more rows, and 16 more variables: Kelp_Percentage <dbl>,
## #   API <chr>, lat <dbl>, lon <dbl>, datatype <chr>, file <chr>,
## #   lang <chr>, dst <dbl>, refcode <chr>, fromtime <chr>, totime <chr>,
## #   interval <dbl>, URL <chr>, API_GET <list>, API_Data <list>,
## #   API_GAM <list>

No complaints, and it seems like for every observation reasonable data could be retrieved (no errors, warnings). We now have individual water level GAMs for each observation. Let’s use these models to predict the water level at the time of measuring the water depth. Again we use possibly() as a safety measure.

data_API <- data_API %>% 
  mutate(Water_Level = map2_dbl(API_GAM, (as.double(hm(Time)) / 60) / 60, 
                                possibly(~ predict(.x, newdata = tibble(Time = .y)) / 100, 
                                       otherwise = NA)), 
         Corrected_Depth = Depth - Water_Level)

data_API
## # A tibble: 249 x 25
##    Date   Time  Substrate_Type GPS_Latitude GPS_Longitude Depth Comment   
##    <chr>  <chr> <chr>                 <dbl>         <dbl> <dbl> <chr>     
##  1 2017-… 14:06 Grus-Stein             63.1          7.27  31.3 Vegetasjo…
##  2 2017-… 14:23 Sand                   63.1          7.30  36.9 Vegetasjo…
##  3 2017-… 15:25 Grus                   62.5          5.94  31.6 Grusbunn …
##  4 2017-… 15:37 Skjellsand             62.5          5.94  25.3 Vegetasjo…
##  5 2017-… 15:55 Skjellsand             62.5          5.92  32.7 Vegetasjo…
##  6 2017-… 12:05 Stein                  62.6          5.88  43.3 Steinbunn…
##  7 2017-… 13:14 Stein                  62.4          5.43  37.4 Steinbunn…
##  8 2017-… 16:00 Sand-Stein             63.5          8.12  31.3 Sandbunn …
##  9 2017-… 15:43 Fjell                  63.5          8.12  33.1 Vegetasjo…
## 10 2017-… 16:17 Fjell-Stein            63.5          8.08  32.8 Vegetasjo…
## # ... with 239 more rows, and 18 more variables: Kelp_Percentage <dbl>,
## #   API <chr>, lat <dbl>, lon <dbl>, datatype <chr>, file <chr>,
## #   lang <chr>, dst <dbl>, refcode <chr>, fromtime <chr>, totime <chr>,
## #   interval <dbl>, URL <chr>, API_GET <list>, API_Data <list>,
## #   API_GAM <list>, Water_Level <dbl>, Corrected_Depth <dbl>

Looks like we have everything we wanted! Now let’s clean up the data (remove some of the now unnecessary columns) and make it ready to store in an SQLite database. We also now change the Date column back to a “proper” date column.

data_final <- data_API %>% 
  select(-API, -lat, -lon, -datatype, -file, -lang, -dst, -refcode, -fromtime, 
         -totime, -interval, -URL, -API_GET, -API_Data, -API_GAM) %>% 
  mutate(Date = ymd(Date))

data_final
## # A tibble: 249 x 10
##    Date       Time  Substrate_Type GPS_Latitude GPS_Longitude Depth
##    <date>     <chr> <chr>                 <dbl>         <dbl> <dbl>
##  1 2017-04-03 14:06 Grus-Stein             63.1          7.27  31.3
##  2 2017-04-03 14:23 Sand                   63.1          7.30  36.9
##  3 2017-04-11 15:25 Grus                   62.5          5.94  31.6
##  4 2017-04-11 15:37 Skjellsand             62.5          5.94  25.3
##  5 2017-04-11 15:55 Skjellsand             62.5          5.92  32.7
##  6 2017-04-12 12:05 Stein                  62.6          5.88  43.3
##  7 2017-06-23 13:14 Stein                  62.4          5.43  37.4
##  8 2017-04-06 16:00 Sand-Stein             63.5          8.12  31.3
##  9 2017-04-06 15:43 Fjell                  63.5          8.12  33.1
## 10 2017-04-06 16:17 Fjell-Stein            63.5          8.08  32.8
## # ... with 239 more rows, and 4 more variables: Comment <chr>,
## #   Kelp_Percentage <dbl>, Water_Level <dbl>, Corrected_Depth <dbl>
data_final %>% summary()
##       Date                Time           Substrate_Type    
##  Min.   :2017-04-03   Length:249         Length:249        
##  1st Qu.:2017-04-05   Class :character   Class :character  
##  Median :2017-04-10   Mode  :character   Mode  :character  
##  Mean   :2017-04-19                                        
##  3rd Qu.:2017-04-25                                        
##  Max.   :2017-06-23                                        
##                                                            
##   GPS_Latitude   GPS_Longitude       Depth         Comment         
##  Min.   :62.04   Min.   :5.264   Min.   : 2.70   Length:249        
##  1st Qu.:62.23   1st Qu.:5.446   1st Qu.:10.00   Class :character  
##  Median :62.52   Median :6.091   Median :18.20   Mode  :character  
##  Mean   :62.66   Mean   :6.565   Mean   :19.73                     
##  3rd Qu.:63.06   3rd Qu.:7.724   3rd Qu.:29.50                     
##  Max.   :63.50   Max.   :8.248   Max.   :43.30                     
##                                                                    
##  Kelp_Percentage   Water_Level       Corrected_Depth  
##  Min.   :  0.00   Min.   :-0.01121   Min.   : 0.9898  
##  1st Qu.:  0.00   1st Qu.: 0.75854   1st Qu.: 9.2559  
##  Median : 10.00   Median : 1.00354   Median :17.3076  
##  Mean   : 42.31   Mean   : 1.10819   Mean   :18.6207  
##  3rd Qu.:100.00   3rd Qu.: 1.65164   3rd Qu.:28.6319  
##  Max.   :100.00   Max.   : 2.12296   Max.   :41.2924  
##  NA's   :6

Finally, we can write the corrected, final data as an Excel-compatible .csv file.

data_final %>% 
  write_excel_csv(path = "../Data/data_final.csv")

3 Calculate average, minimum and maximum water level in SQLite

First, we create a in-memory SQLite database and store the final data inside it.

Local_Database <- dbConnect(SQLite(), ":memory:")

Local_Database %>% 
  dbWriteTable("data_final", data_final)

Let’s see if the data is actually in the database. We could also use the dbReadTable() function,

Local_Database %>% 
  dbListTables()
## [1] "data_final"
Local_Database %>% 
  dbListFields("data_final")
##  [1] "Date"            "Time"            "Substrate_Type" 
##  [4] "GPS_Latitude"    "GPS_Longitude"   "Depth"          
##  [7] "Comment"         "Kelp_Percentage" "Water_Level"    
## [10] "Corrected_Depth"

For the final step, we send a query to the database and fetch its response; in this case, the average, minimum, and maximum of the Water_Levels, which we have used above to correct the water depth measurements.

Query <- Local_Database %>% 
  dbSendQuery("SELECT avg(Water_Level), min(Water_Level), max(Water_Level) FROM data_final") 

Query %>% 
  dbFetch()
##   avg(Water_Level) min(Water_Level) max(Water_Level)
## 1         1.108192      -0.01121305         2.122958

To tidy things up, we now also clear the results and disconnect from the database

Query %>% 
  dbClearResult()

Local_Database %>% 
  dbDisconnect()

4 Visualize the sampling points on a map

For visualization, we have to transform the data set into a simple feature (sf) object. These objects are able to process more complex geometric projections. Here, we’ll use the Lambert azimuthal equal-area projection (LAEA) for correct display of the area.

data_final.sf <- data_final %>% 
  st_as_sf(coords = c("GPS_Longitude", "GPS_Latitude"), remove = FALSE, crs = "+proj=longlat +datum=WGS84 +no_defs") %>% 
  st_transform("+proj=laea")

data_final.sf
## Simple feature collection with 249 features and 10 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 318738.1 ymin: 6553545 xmax: 484139.6 ymax: 6698205
## epsg (SRID):    NA
## proj4string:    +proj=laea +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs
## # A tibble: 249 x 11
##    Date       Time  Substrate_Type GPS_Latitude GPS_Longitude Depth
##    <date>     <chr> <chr>                 <dbl>         <dbl> <dbl>
##  1 2017-04-03 14:06 Grus-Stein             63.1          7.27  31.3
##  2 2017-04-03 14:23 Sand                   63.1          7.30  36.9
##  3 2017-04-11 15:25 Grus                   62.5          5.94  31.6
##  4 2017-04-11 15:37 Skjellsand             62.5          5.94  25.3
##  5 2017-04-11 15:55 Skjellsand             62.5          5.92  32.7
##  6 2017-04-12 12:05 Stein                  62.6          5.88  43.3
##  7 2017-06-23 13:14 Stein                  62.4          5.43  37.4
##  8 2017-04-06 16:00 Sand-Stein             63.5          8.12  31.3
##  9 2017-04-06 15:43 Fjell                  63.5          8.12  33.1
## 10 2017-04-06 16:17 Fjell-Stein            63.5          8.08  32.8
## # ... with 239 more rows, and 5 more variables: Comment <chr>,
## #   Kelp_Percentage <dbl>, Water_Level <dbl>, Corrected_Depth <dbl>,
## #   geometry <POINT [m]>

Looks like everything is there. We now need a sufficiently good map of Norway. Luckily, the Database of Global Administrative Areas (GADM) offers such a map, even ready for direct use with sf. Once downloaded with download.file(), the files can be used and adequatly manipulated. Note that the downloading part has been commented out

# download.file(url = "https://biogeo.ucdavis.edu/data/gadm3.6/Rsf/gadm36_NOR_0_sf.rds", 
#               dest = "../Data/gadm36_NOR_0_sf.rds")

NO_shape <- readRDS("../Data/gadm36_NOR_0_sf.rds") %>% 
  st_transform("+proj=laea +datum=WGS84 +no_defs")

NO_shape
## Simple feature collection with 1 feature and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 250389.7 ymin: 6164663 xmax: 1424466 ymax: 7502597
## epsg (SRID):    NA
## proj4string:    +proj=laea +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs
##   GID_0 NAME_0                       geometry
## 1   NOR Norway MULTIPOLYGON (((506676.7 61...

Now we’re almost good to go. But instead of plotting the location of the sample points in the entire Norway, we want to “zoom” into the map and get a close-up of the locations. For this, we need to define the corner points of the area of the measurment coordinates, and transform them to the LAEA format.

the_box <- cbind(c(min(data_final.sf$GPS_Longitude), 
                   max(data_final.sf$GPS_Longitude), 
                   max(data_final.sf$GPS_Longitude), 
                   min(data_final.sf$GPS_Longitude), 
                   min(data_final.sf$GPS_Longitude)),
                 c(min(data_final.sf$GPS_Latitude), 
                   min(data_final.sf$GPS_Latitude), 
                   max(data_final.sf$GPS_Latitude), 
                   max(data_final.sf$GPS_Latitude), 
                   min(data_final.sf$GPS_Latitude))) %>% 
  list() %>% 
  st_polygon() %>% 
  st_sfc(crs = "+proj=longlat +datum=WGS84 +no_defs") %>% 
  st_transform(crs = "+proj=laea") %>% 
  st_bbox()

the_box
##      xmin      ymin      xmax      ymax 
##  308144.9 6553119.2  502945.1 6698679.5

We now have everything we need in place to visualize the data.

NO_shape %>% 
  ggplot() +
  geom_sf() +
  geom_sf(mapping = aes(color = Corrected_Depth, alpha = 1 / Corrected_Depth), 
          data = data_final.sf %>% arrange(Corrected_Depth), size = 1) +
  coord_sf(xlim = c(the_box[1], the_box[3]), 
           ylim = c(the_box[2], the_box[4])) +
  scale_color_gradient(low = "yellow", high = "red") +
  scale_alpha_continuous(range = c(0.2, 1), guide = "none") +
  labs(title = expression(bold("Position of sampling points")), 
       subtitle = "Showing the tide corrected water depth", 
       x = expression(bold("Longitude")), 
       y = expression(bold("Latitude")), 
       color = "Depth", 
       caption = "Data: NIVA") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "cadetblue1"), 
        panel.grid = element_line(color = "white", size = 1))

The plot is saved as follows:

ggsave("../Figures/Final_plot.png", height = 3.5, units = "in")
## Saving 7 x 3.5 in image

As an experimental bonus, we add an interactive graphic using the plotly package. Note that we don’t load the package (library(plotly)), as this would overwrite dplyr’s filter() function; instead, we call ggplotly() directly from the plotly package.

Final_Plot <- NO_shape %>% 
  ggplot() +
  geom_sf() +
  geom_sf(mapping = aes(color = Corrected_Depth, alpha = 1 / Corrected_Depth, 
                        text = paste0("Longitude: ", round(GPS_Longitude, digits = 3), " ° E\n", 
                                      "Latitude: ", round(GPS_Latitude, digits = 3), " ° N\n", 
                                      "Depth: ", round(Corrected_Depth, digits = 2), " m")), 
          data = data_final.sf %>% arrange(Corrected_Depth), size = 1) +
 coord_sf(xlim = c(the_box[1], the_box[3]), 
           ylim = c(the_box[2], the_box[4])) +
  scale_color_gradient(low = "yellow", high = "red") +
  scale_alpha_continuous(range = c(0.5, 1), guide = "none") +
  labs(title = "Position of Sampling Points", 
       x = "Longitude", 
       y = "Latitude", 
       color = "Depth") +
  theme_bw() +
  theme(panel.background = element_rect(fill = "cadetblue1"), 
        panel.grid = element_line(color = "white", size = 0.5))
## Warning: Ignoring unknown aesthetics: text
plotly::ggplotly(Final_Plot, tooltip = c("text"))